This notebook is for Data Cleaning.

Step 0: load the raw data, load the age 9 features, extract age 9 data

Step 1: clean data

categorical=colnames(ED.categorical)
ED.factor=clean.factor(ED.continuous)
NAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercionNAs introduced by coercion
ED.continuous=ED.continuous[,!colnames(ED.continuous) %in% colnames(ED.factor)]
ED.continuous=clean.continuous(ED.continuous)
categorical=c(categorical,colnames(ED.factor[,grep("*isna",colnames(ED.factor))]),colnames(ED.continuous[,grep("*isna",colnames(ED.continuous))]))
#combine the data
ED.final<-cbind(ED.continuous,ED.categorical,ED.factor)
ED.final=as.data.frame(ED.final)

Step 2: solving missing data problem

Step 2: Combine features with labels

Bdf=read.csv("../data/features.csv")
Bcodes=Bdf$Codes
data.filtered= data.train[,Bcodes]
library(e1071)

Attaching package: ‘e1071’

The following object is masked from ‘package:Hmisc’:

    impute
library(tree)
library(caret)

Attaching package: ‘caret’

The following object is masked from ‘package:survival’:

    cluster
library(rpart)
#set.seed (1)
i.train = sample(1:nrow(data.filtered), nrow(data.filtered)*0.9)
dtrain=data.filtered[i.train,]
dtest=data.filtered[-i.train,]
ltrain=label$gpa[i.train]
ltest=label$gpa[-i.train]
dt=cbind(ltrain,dtrain)
dt=as.data.frame(dt)
tree.ff=rpart(ltrain~.,dt,method="anova")
min.xerror <- tree.ff$cptable[which.min(tree.ff$cptable[,"xerror"]),"CP"]
min.xerror #0.01629374
[1] 0.01233282
best.tree=prune(tree.ff,cp = min.xerror) 
tree.predict=predict(tree.ff,newdata = dtest)
error <- mean((tree.predict-ltest)^2)
cbind(tree.predict,ltest)
     tree.predict ltest
61       2.980769  4.00
64       3.433962  3.75
72       3.433962  3.25
165      2.668942  1.50
173      2.668942  3.25
182      2.668942  2.50
196      3.433962  2.75
251      2.668942  3.00
403      2.668942  3.00
426      2.668942  1.75
429      2.668942  2.75
497      3.433962  3.25
499      2.980769  3.00
559      2.668942  4.00
572      2.668942  2.75
624      3.433962  3.75
653      2.668942  1.50
664      2.980769  3.75
680      2.980769  3.25
714      2.668942  2.25
806      2.668942  3.50
808      2.916216  3.50
834      3.433962  3.25
912      2.668942  2.75
956      2.668942  3.75
983      2.916216  3.75
1002     2.916216  2.75
1018     2.668942  3.75
1157     2.916216  2.75
1211     2.668942  2.25
1258     2.668942  3.00
1280     2.668942  3.00
1307     2.668942  2.00
1375     2.916216  3.25
1383     2.980769  2.00
1389     2.668942  1.75
1555     2.668942  1.50
1620     2.980769  4.00
1654     2.668942  3.00
1667     3.098485  3.25
1783     3.331250  2.00
1790     2.916216  2.75
1806     2.668942  2.00
1812     3.433962  3.25
1825     2.980769  2.25
1992     3.331250  2.75
2056     3.433962  3.75
2092     2.980769  3.75
2174     2.668942  2.50
2185     2.668942  3.00
2186     2.668942  1.75
2203     2.916216  3.00
2236     2.916216  3.25
2265     2.916216  2.50
2326     2.668942  4.00
2346     2.668942  2.25
2358     2.668942  1.75
2370     2.668942  3.00
2372     2.668942  2.25
2376     2.668942  3.50
2403     2.668942  2.75
2430     2.916216  3.75
2468     2.668942  2.50
2483     2.668942  3.25
2497     2.668942  3.75
2516     2.668942  3.25
2572     2.668942  4.00
2590     2.916216  2.75
2611     2.668942  2.25
2655     2.668942  2.50
2658     2.916216  4.00
2668     2.668942  2.75
2719     2.916216  3.50
2775     2.916216  2.50
2796     2.668942  2.75
2800     2.668942  2.50
2812     2.668942  2.75
2867     2.668942  1.50
2877     3.331250  3.50
2933     2.668942  3.25
2958     2.980769  2.00
3012     2.668942  2.25
3023     3.331250  4.00
3039     2.668942  2.00
3042     2.668942  2.50
3052     3.098485  2.50
3107     2.668942  1.50
3237     2.668942  2.25
3254     2.668942  2.50
3263     2.916216  2.25
3279     3.433962  4.00
3347     3.098485  4.00
3355     2.980769  2.75
3382     2.668942  2.50
3394     2.668942  2.50
3395     3.331250  3.75
3422     2.668942  2.25
3481     2.668942  2.75
3484     2.668942  4.00
3544     2.668942  3.75
3567     2.668942  3.50
3634     2.916216  3.25
3643     2.668942  3.00
3658     2.668942  2.75
3681     3.433962  3.50
3700     2.668942  3.25
3701     3.433962  3.00
3734     2.668942  3.25
3782     2.668942  2.50
3802     2.916216  3.50
3852     2.980769  3.00
3920     2.668942  3.00
3934     3.433962  3.25
4002     2.916216  2.25
4028     2.980769  3.25
4155     2.668942  2.25
4224     2.916216  3.50

Random Forest

source("~/GitHub/spr2017-proj3-group7/lib/train.R")

model.best<-Train(data.filtered,factor(label.train$gpa))
model.best$errors
LS0tCnRpdGxlOiAiUiBOb3RlYm9vayIKYXV0aG9yOiBZdWUgR2FvCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KClRoaXMgbm90ZWJvb2sgaXMgZm9yIERhdGEgQ2xlYW5pbmcuCgpTdGVwIDA6IGxvYWQgdGhlIHJhdyBkYXRhLCBsb2FkIHRoZSBhZ2UgOSBmZWF0dXJlcywgZXh0cmFjdCBhZ2UgOSBkYXRhCmBgYHtyLCBpbmNsdWRlPUZBTFNFfQpsaWJyYXJ5KGRhdGEudGFibGUpCmxpYnJhcnkoc3RyaW5ncikKbGlicmFyeShkbW0pCmxpYnJhcnkoSG1pc2MpCmxpYnJhcnkobWljZSkKCmxvYWQoIn4vR2l0SHViL1NwcjIwMTctcHJvajUtZ3JwMy9kYXRhL2JhY2tncm91bmQuUkRhdGEiKQpzb3VyY2UoIn4vR2l0SHViL1NwcjIwMTctcHJvajUtZ3JwMy9saWIvaGVscGVyX2RhdGEuUiIpCiNyYXc9cmVhZC5jc3YoIn4vRG9jdW1lbnRzL0ZGQ2hhbGxlbmdlL2JhY2tncm91bmQuY3N2IixoZWFkZXI9VFJVRSkKCmZlYXR1cmVzPC1jb2xuYW1lcyhiYWNrZ3JvdW5kKQoKI2NyZWF0ZSBjb2RlYm9vawpjb2RlYm9va3M8LWMoImNoaWxkIiwibW9tIiwiZGFkIiwidGVhY2hlciIpCmRhdGEuaW5mbzwtdmVjdG9yKCkKCmZvciAoaSBpbiBjb2RlYm9va3MpewpmZWF0LnRhYmxlPC1yZWFkLmNzdihwYXN0ZTAoIn4vR2l0SHViL1NwcjIwMTctcHJvajUtZ3JwMy9kYXRhL2NvZGVib29rL2ZmXyIsaSwiX2NiOS5jc3YiKSxoZWFkZXI9RkFMU0UpCmZlYXQudGFibGU8LWZlYXQudGFibGVbLC0xXQpmZWF0LnRhYmxlPC1mZWF0LnRhYmxlWy0xLF0KZmVhdC50YWJsZTwtY2JpbmQocmVwKGksbnJvdyhmZWF0LnRhYmxlKSksZmVhdC50YWJsZSkKZGF0YS5pbmZvPC1yYmluZChkYXRhLmluZm8sZmVhdC50YWJsZSkKfQoKY29sbmFtZXMoZGF0YS5pbmZvKTwtYygiY2xhc3MiLCJjb2RlIiwiZGVzY3JpcHRpb24iKQpkYXRhLmluZm89YXMuZGF0YS5mcmFtZShkYXRhLmluZm8pCmZlYXRudW08LW5yb3coZGF0YS5pbmZvKQoKZXh0cmFjdC5mZWF0dXJlPC1mZWF0dXJlcyAlaW4lIGRhdGEuaW5mbyRjb2RlICAKc3VtKGV4dHJhY3QuZmVhdHVyZSkKCgpleHRyYWN0LmRhdGE8LWJhY2tncm91bmRbLGV4dHJhY3QuZmVhdHVyZV0KZXh0cmFjdC5kYXRhPC1jYmluZChjaGFsbGVuZ2VJRD1iYWNrZ3JvdW5kWywxXSxleHRyYWN0LmRhdGEpCgp3cml0ZS5jc3YoZXh0cmFjdC5kYXRhLGZpbGU9In4vR2l0SHViL1NwcjIwMTctcHJvajUtZ3JwMy9kYXRhL2V4dHJhY3RfZGF0YS5jc3YiKQpzYXZlKGRhdGEuaW5mbywgZmlsZT0iZGF0YV9pbmZvLlJEYXRhIikKYGBgCgoKU3RlcCAxOiBjbGVhbiBkYXRhCmBgYHtyfQojcmVtb3ZlIGNvbHVtbnMgd2l0aCBtaXNzaW5nIHZhbHVlcyBtb3JlIHRoYW4gODAlCmV4dHJhY3QuZGF0YTwtcmVhZC5jc3YoIn4vR2l0SHViL1NwcjIwMTctcHJvajUtZ3JwMy9kYXRhL2V4dHJhY3RfZGF0YS5jc3YiKQpFRDwtZGl2aWRlLmRhdGEoZXh0cmFjdC5kYXRhKQpFRC5jYXRlZ29yaWNhbD1FRFtbMV1dCkVELmNvbnRpbnVvdXM9RURbWzJdXQoKY2F0ZWdvcmljYWw9Y29sbmFtZXMoRUQuY2F0ZWdvcmljYWwpCgpFRC5mYWN0b3I9Y2xlYW4uZmFjdG9yKEVELmNvbnRpbnVvdXMpCgpFRC5jb250aW51b3VzPUVELmNvbnRpbnVvdXNbLCFjb2xuYW1lcyhFRC5jb250aW51b3VzKSAlaW4lIGNvbG5hbWVzKEVELmZhY3RvcildCkVELmNvbnRpbnVvdXM9Y2xlYW4uY29udGludW91cyhFRC5jb250aW51b3VzKQoKY2F0ZWdvcmljYWw9YyhjYXRlZ29yaWNhbCxjb2xuYW1lcyhFRC5mYWN0b3JbLGdyZXAoIippc25hIixjb2xuYW1lcyhFRC5mYWN0b3IpKV0pLGNvbG5hbWVzKEVELmNvbnRpbnVvdXNbLGdyZXAoIippc25hIixjb2xuYW1lcyhFRC5jb250aW51b3VzKSldKSkKI2NvbWJpbmUgdGhlIGRhdGEKCkVELmZpbmFsPC1jYmluZChFRC5jb250aW51b3VzLEVELmNhdGVnb3JpY2FsLEVELmZhY3RvcikKCkVELmZpbmFsPWFzLmRhdGEuZnJhbWUoRUQuZmluYWwpCmBgYAoKCgpTdGVwIDI6IHNvbHZpbmcgbWlzc2luZyBkYXRhIHByb2JsZW0KYGBge3J9CgpmaW5hbC5taXM8LUVELmZpbmFsWyx3aGljaCh1bmxpc3QobGFwcGx5KEVELmZpbmFsLCBmdW5jdGlvbih4KSBhbnlOQSh4KSkpKV0KaW1wdXRlZF9EYXRhIDwtIG1pY2UoZmluYWwubWlzLCBtPTUsIG1heGl0ID0gNTAsIG1ldGhvZCA9ICdwbW0nLCBzZWVkID0gNTAwKQoKY29tcGxldGVEYXRhPWxpc3QoKQoKZm9yIChpIGluIDE6NSl7CiAgY29tcGxldGVEYXRhW1tpXV08LWNiaW5kLmRhdGEuZnJhbWUoY2hhbGxlbmdlSUQ9RUQuY29udGludW91c1ssMV0sY29tcGxldGUoaW1wdXRlZF9EYXRhLGkpLEVELmNhdGVnb3JpY2FsKQogIHdyaXRlLmNzdihjb21wbGV0ZURhdGFbW2ldXSxmaWxlPXBhc3RlMCgifi9HaXRIdWIvU3ByMjAxNy1wcm9qNS1ncnAzL2RhdGEvaW1wdXRlZCIsaSwiLmNzdiIpLHJvdy5uYW1lcyA9IEZBTFNFKQp9CgpzYXZlKGNhdGVnb3JpY2FsLCBmaWxlPSJjYXRlZ29yaWNhbC5SRGF0YSIpCgpgYGAKCgpTdGVwIDI6IENvbWJpbmUgZmVhdHVyZXMgd2l0aCBsYWJlbHMKYGBge3J9CgoKCmBgYAoKCmBgYHtyfQoKQmRmPXJlYWQuY3N2KCIuLi9kYXRhL2ZlYXR1cmVzLmNzdiIpCkJjb2Rlcz1CZGYkQ29kZXMKZGF0YS5maWx0ZXJlZD0gZGF0YS50cmFpblssQmNvZGVzXQoKYGBgCgoKYGBge3J9CmxpYnJhcnkoZTEwNzEpCmxpYnJhcnkodHJlZSkKbGlicmFyeShjYXJldCkKbGlicmFyeShycGFydCkKCgojc2V0LnNlZWQgKDEpCmkudHJhaW4gPSBzYW1wbGUoMTpucm93KGRhdGEuZmlsdGVyZWQpLCBucm93KGRhdGEuZmlsdGVyZWQpKjAuOSkKZHRyYWluPWRhdGEuZmlsdGVyZWRbaS50cmFpbixdCmR0ZXN0PWRhdGEuZmlsdGVyZWRbLWkudHJhaW4sXQpsdHJhaW49bGFiZWwkZ3BhW2kudHJhaW5dCmx0ZXN0PWxhYmVsJGdwYVstaS50cmFpbl0KCmR0PWNiaW5kKGx0cmFpbixkdHJhaW4pCmR0PWFzLmRhdGEuZnJhbWUoZHQpCgp0cmVlLmZmPXJwYXJ0KGx0cmFpbn4uLGR0LG1ldGhvZD0iYW5vdmEiKQptaW4ueGVycm9yIDwtIHRyZWUuZmYkY3B0YWJsZVt3aGljaC5taW4odHJlZS5mZiRjcHRhYmxlWywieGVycm9yIl0pLCJDUCJdCm1pbi54ZXJyb3IgIzAuMDE2MjkzNzQKYmVzdC50cmVlPXBydW5lKHRyZWUuZmYsY3AgPSBtaW4ueGVycm9yKSAKdHJlZS5wcmVkaWN0PXByZWRpY3QodHJlZS5mZixuZXdkYXRhID0gZHRlc3QpCgplcnJvciA8LSBtZWFuKCh0cmVlLnByZWRpY3QtbHRlc3QpXjIpCgpjYmluZCh0cmVlLnByZWRpY3QsbHRlc3QpCmBgYAoKClJhbmRvbSBGb3Jlc3QKYGBge3J9CnNvdXJjZSgifi9HaXRIdWIvc3ByMjAxNy1wcm9qMy1ncm91cDcvbGliL3RyYWluLlIiKQoKbW9kZWwuYmVzdDwtVHJhaW4oZGF0YS5maWx0ZXJlZCxmYWN0b3IobGFiZWwudHJhaW4kZ3BhKSkKbW9kZWwuYmVzdCRlcnJvcnMKYGBgCgo=